Molecular structural dataset of lignin macromolecule elucidating experimental structural compositions

Lignin is one of the most abundant biopolymers in nature and has great potential to be transformed into high-value chemicals. However, the limited availability of molecular structure data hinders its potential industrial applications. Herein, we present the Lignin Structural (LGS) Dataset that includes the molecular structure of milled wood lignin focusing on two major monomeric units (coniferyl and syringyl), and the six most common interunit linkages (phenylpropane β-aryl ether, resinol, phenylcoumaran, biphenyl, dibenzodioxocin, and diaryl ether). The dataset constitutes a unique resource that covers a part of lignin’s chemical space characterized by polymer chains with lengths in the range of 3 to 25 monomer units. Structural data were generated using a sequence-controlled polymer generation approach that was calibrated to match experimental lignin properties. The LGS dataset includes 60 K newly generated lignin structures that match with high accuracy (~90%) the experimentally determined structural compositions available in the literature. The LGS dataset is a valuable resource to advance lignin chemistry research, including computational simulation approaches and predictive modelling.


Method details
Lignin structure generation tool architecture and implementation Lignin Structure (LGS) generator tool was developed as a standalone utility for computing the network of lignin molecular structure and defining a large set of lignin molecules. Tool is implemented using Core Java, major functionality includes modified version of Heap's algorithm for finding various permutations of lignin monomer, directed graph creation using combination algorithm and creation of topological matrices, integration of CDK (Chemistry Development Toolkit) for molecular structure generation and creation of SMILES notation for structures generated, storing molecular information as MDL Mol files, evaluating the structural features and storing as JSON file. Software architecture of the lignin structure generator tool is provided in Figure A. The tool can be used to generate different structural variations for a given set of experimental observations by configuring the required parameters such as monomer ratio (S, G and H), bond frequencies (β-O-4, β-β, β-5, 4-O-5, 5-5 and DBDO) in project configuration file (projectconfig.yaml) file. The tool can generate chemically correct and legible 2D structure diagrams of natural lignin for all wood types that includes hardwood, softwood and herbaceous. Molecular structure formation using CDK CDK is a widely used open-source cheminformatics toolkit [1]. The CDK provides data structures to represent chemical concepts along with methods to manipulate such structures and perform computations on them. The monomer descriptors were initialized as IAtomContainer object from CDK and linkages between monomer units are generated using IBond object with the edge definition from the directed graphs created. Class MonolignolBase is defined an abstract class for initializing the template of monomer object containing Phenyl propane unit. Class relation diagram Figure B shows the CDK integration for molecular structure creation.

Data Records
Summary of the dataset

Structure Visualization in 2D and 3D representations
Lignin structures included in the LGS dataset can be visualized as 2D and 3D representations. Figure C shows 2D and 3D visualizations of G and SG type structures using CDK and Avogadro, respectively. TMAP visualization of SG type structures Figure D shows the SG type structural dataset using Tree MAP algorithm.

Figure D: TMAP visualization for SG Type Structures. Color is based on free phenolic-OH group in the molecule. Clicking on individual datapoint (circle or leaf) in the tree view displays compound information detailing the structural features and link to 3D view.
Data format specifications:

SMILES notation
The Simplified Molecular-Input Line-Entry System (SMILES)[2, 3] is a line notation for describing chemical structures using short ASCII strings. Lignin structures are stored as SMILES string representations of the generated molecules, as it is a key asset in cheminformatics and is becoming increasingly relevant to the general chemical community, due to the steadily growing impact of Big Data and Machine Learning. Example of the SMILES notation is shown in Table B Lignin Oligomer

JSON format
We used JSON (JavaScript Object Notation) to integrate structural features of all possible permutations for a given degree of polymerization. JSON is widely used data-interchange format. A common data format with diverse uses and stores data as key/value pairs. JSON data definition used in this study is presented in Figure E: JSON File definition for G type structures with DP as 4. The JSON datafile can be directly imported into database such as MongoDB for easy data analysis. This file provide the catalog of structural information with specific DP. Llignin id (lig_id) in the JSON object is the unique id to locate the properties of specific structure in MOL and CSV file respectively.

Matrices:
Machine readable form of molecular structure representing linkages between monomers given as adjacency and connectivity matrices. It represents the presence of linkage between the monomers (adjacency) and type of linkage between monomers (connectivity) respectively. Example of a CSV file and its fields definition is explained in Figure F. Molecular data file: MOL files are generally classified as data files that contain molecular data information, atom, bonds, coordinates, and connectivity information in plain text format. It was developed and published by Molecular Design Limited (MDL). MDL molfile version V3000 is generated for lignin structures using CDK (Chemistry Development Toolkit). V3000 [4] is used for representing proteins and polymers structures. The Avogadro software can be used in Microsoft Windows based systems, Linux and Mac OS to access and view MOL files. Example of a MOL file is provided in Figure G.